PS 818 - Statistical Models
September 3, 2025
\[ \require{cancel} \DeclareMathOperator*{\argmin}{arg\,min} \]
\[ \DeclareMathOperator*{\argmax}{arg\,max} \]
Instructor: Anton Strezhnev
Logistics:
What is this course about?
Overall: An interest in learning and willingness to ask questions.
Assume a background in intro probability and statistics (1st year sequence)
Some prior knowledge of causal inference helpful but not critical
You should also be familiar with linear regression
If you want some review, check out chapters 1-6 of “Regression and Other Stories”
Week 2-4: Introduction to likelihood inference and GLMs
Week 5-7: Bayesian Inference and Multilevel Models
Week 8: Survey data
Week 9: Mixture Models and the EM algorithm
Week 10: Item response theory and ideal point models
Week 11-13: Flexible regression (ridge/lasso, forests, kernels)
Week 14 Semi-parametric theory
Week 15: Big regressions!
Discrete random variables take on a countable number of values (e.g. Bernoulli r.v. can take on 0 or 1) and have a probability mass function (PMF)
\[p(x) = Pr(X = x)\]
Continuous random variables take on an uncountable number of values (e.g. the Normal distribution on \((-\infty, \infty)\)).
\[Pr(X \in \mathcal{A}) = \int_{\mathcal{A}} f(x)dx\]
Remember: PMFs (and PDFs) sum (integrate) to \(1\) over the support of the random variable.
One important property of a random variable is its expectation \(\mathbb{E}[X]\). We’ll often make assumptions about the expectation of an R.V. while remaining agnostic about its true distribution.
\[\mathbb{E}[X] = \sum_{x \in \mathcal{X}} x Pr(X = x)\]
For continuous r.v. we have an integral
\[\mathbb{E}[X] = \int_{x \mathcal{X}} x f(x) dx\]
Fun fact: we can get the expectation of any function of \(g(X)\) just by plugging it into the integral
\[\mathbb{E}[g(X)] = \int_{x \mathcal{X}} g(x) f(x) dx\]
You’ll need to know some essential properties of expectations to simplify certain problems
Most important. Linearity. For any two random variables \(X\) and \(Y\) and constants \(a\) and \(b\)
\[\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]\]
Note that for any generic function \(g()\), \(\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])\). If \(g()\) is convex, by Jensen’s inequality \(\mathbb{E}[g(X)] \ge g(\mathbb{E}[X])\)
For a binary r.v. \(X \in \{0, 1\}\), it’s helpful to remember the “fundamental bridge” between expectations and probability
\[\mathbb{E}[X] = Pr(X = 1)\]
We also care about the spread of a random variable - how far is the average draw of \(X\) from its mean \(\mathbb{E}[X]\).One measure of this is the variance.
\[Var(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]\]
Also written as
\[Var(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2\]
\[Cov(X, Y) = E\left[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\right] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]
\[Var(aX) = a^2Var(X)\] - For any two random variables \(X\) and \(Y\)
\[Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)\] \[Var(X - Y) = Var(X) + Var(Y) - 2Cov(X,Y)\]
\[Var(X + Y) = Var(X) + Var(Y)\] \[Var(X - Y) = Var(X) + Var(Y)\]
Key concept - Dependence and independence. If two variables are independent, the distribution of one does not change conditional on the other. We’ll write this using the \(\perp \!\!\! \perp\) notation.
\[f(Y_i | D_i = 1) = f(Y_i| D_i = 0) = f(Y_i)\]
Two variables can be conditionally independent in that they are independent only when conditioning on a third variable. For example, we can have \(Y_i \cancel{\perp \!\!\! \perp} D_i\) but \(Y_i \perp \!\!\! \perp D_i | X_i\). This implies
\[f(Y_i| D_i = 1, X_i = x) = f(Y_i| D_i = 0, X_i = x) = f(Y_i | X_i =x)\]
Remember: Conditional independence does not imply independence or vice-versa!
A central object of interest in statistics is the conditional expectation function (CEF) \(\mathbb{E}[Y | X]\).
All the usual properties of expectations apply to conditional expectations.
\[\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y|X]]\]
Easiest to think about this in terms of discrete r.v.s
\[\mathbb{E}[Y] = \sum_{x \in \mathcal{X}} \mathbb{E}[Y | X = x] Pr(X = x)\]
How do we know if we’ve picked a good estimator? Will it be close to the truth? Will it besystematically higher or lower than the target?
We want to derive some of its properties
Is the expectation of \(\hat{\mu}\) equal to \(\mu\)?
\[\mathbb{E}[\hat{\mu}] = \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n}E\left[\sum_{i=1}^n Y_i\right]\]
Next we use linearity of expectations
\[\frac{1}{n}\mathbb{E}\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[Y_i\right]\]
Finally, under our i.i.d. assumption
\[\frac{1}{n}\sum_{i=1}^n \mathbb{E}\left[Y_i\right] = \frac{1}{n}\sum_{i=1}^n \mu = \frac{n \mu}{n} = \mu\]
Therefore, the bias, \(\text{Bias}(\hat{\mu}) = \mathbb{E}[\hat{\mu}] - \mu = 0\)
What is the variance of \(\hat{\mu}\)? Again, start by pulling out the constant.
\[Var(\hat{\mu}) = Var\left[\frac{1}{n}\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right]\]
We can further simplify by using our i.i.d. assumption. The variance of a sum of i.i.d. random variables is the sum of the variances.
\[\frac{1}{n^2}Var\left[\sum_{i=1}^n Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right]\]
“identically distributed”
\[\frac{1}{n^2}\sum_{i=1}^n Var\left[Y_i\right] = \frac{1}{n^2}\sum_{i=1}^n \sigma^2 = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}\]
Therefore, the variance is \(\frac{\sigma^2}{n}\)
As \(n\) gets large, what can we say about the estimator \(\hat{\mu}\).
First, we can show that it is consistent – it converges in probability to the true parameter \(\mu\)
Second, we can say something about the distribution of \(\hat{\mu}\).
Consider the ordinary least squares estimator \(\hat{\beta}\) which solves the minimization problem:
\[\hat{\beta} = \argmin_b \ \sum_{i=1}^N (Y_i - X_ib)^2\]
We can do some algebra and find a closed form solution for this optimization problem
\[\hat{\beta} = (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\]
Assumption 1: Linearity
\[Y = \mathbf{X}\beta + \epsilon\]
Assumption 2: Strict exogeneity of the errors
\[\mathbb{E}[\epsilon | \mathbf{X}] = 0\]
These two imply:
\[\mathbb{E}[Y|\mathbf{X}] = \mathbf{X}\beta = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + \dotsc \beta_kX_{k}\]
Best case: Our CEF is truly linear (by luck or we have a saturated model)
Usual case: We’re at least consistent for the best linear approximation to the CEF
Assumption 3: No perfect collinearity
This assumption is needed for identifiability – otherwise no unique solution to the least squares minimization problem exists!
Fails when one column can be written as a linear combination of the others
\[\begin{align*}\hat{\beta} &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}Y)\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}(\mathbf{X}\beta + \epsilon))\\ &= (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\mathbf{X})\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \end{align*}\]
\[\begin{align*} \mathbb{E}[\hat{\beta} | \mathbf{X}] &= \mathbb{E}\bigg[\beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) \bigg| \mathbf{X} \bigg]\\ &= \mathbb{E}[\beta | \mathbf{X}] + \mathbb{E}[(\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon) | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} \mathbb{E}[\epsilon | \mathbf{X}]\\ &= \beta + (\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}0\\ &= \beta \end{align*}\]
Lastly, by law of total expectation
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\mathbb{E}[\hat{\beta}|\mathbf{X}]]\]
Therefore
\[\mathbb{E}[\hat{\beta}] = \mathbb{E}[\beta] = \beta\]
Consistency requires us to show the convergence of \((\mathbf{X}^{\prime}\mathbf{X})^{-1}(\mathbf{X}^{\prime}\epsilon)\) to \(0\) in probability as \(N \to \infty\).
But what have we not assumed?
\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2 & 0 & \cdots & 0\\ 0 & \sigma^2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{bmatrix} = \sigma^2 \mathbf{I}\]
Good news! We can relax homoskedasticity (but still keep no correlation) and do inference on the variance of \(\hat{\beta}\)
\[Var(\epsilon | \mathbf{X}) = \begin{bmatrix} \sigma^2_1 & 0 & \cdots & 0\\ 0 & \sigma^2_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2_n \end{bmatrix}\]
“Robust” standard errors using the Eicker-Huber-White “sandwich” estimator - Consistent but not unbiased for the true sampling variance of \(\hat{\beta}\)
\[\widehat{Var(\hat{\beta})} = (\mathbf{X}^{\prime}\mathbf{X})^{-1} \mathbf{X}^{\prime}\hat{\Sigma}\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\]
\[\epsilon | \mathbf{X} \sim \mathcal{N}(0, \sigma^2)\]
PS 818 - University of Wisconsin - Madison